net.htmlparser.jericho.ParseText java code examples

@Override
public boolean atEndOfAttributes(final Source source, final int pos, final boolean isClosingSlashIgnored) {
  final ParseText parseText = source.getParseText();
  return parseText.charAt(pos) == '>' || (parseText.containsAt("/>", pos));
}

  protected int getEnd(final Source source, final int pos) {
    // This method needs to be overridden because this tag type shares the same start delimiter as the downlevel hidden conditional comment.
    // The closing delimiter of the other tag type must not appear inside this tag.
    // Take the following example:
    // <!--[if IE]> ... <![endif]--> ... <!--[if !(IE 5)]><!--> ... <!--<![endif]-->
    // If the default implementation were used, then the parser would recognise the first tag as:
    // <!--[if IE]> ... <![endif]--> ... <!--[if !(IE 5)]><!-->
    final int delimiterBegin=source.getParseText().indexOf(MicrosoftConditionalCommentTagTypes.DOWNLEVEL_HIDDEN_IF.getClosingDelimiter(),pos);
    if (delimiterBegin==-1) return -1;
    if (source.getParseText().containsAt(getClosingDelimiter(),delimiterBegin)) return delimiterBegin+getClosingDelimiter().length();
    // this is a downlevel hidden conditional comment, so fail this tag type silently without displaying a log message
    return -2;
  }
}

private static CharacterReference getPrevious(final Source source, int pos, final Config.UnterminatedCharacterReferenceSettings unterminatedCharacterReferenceSettings) {
  final ParseText parseText=source.getParseText();
  pos=parseText.lastIndexOf('&',pos);
  while (pos!=-1) {
    final CharacterReference characterReference=construct(source,pos,unterminatedCharacterReferenceSettings);
    if (characterReference!=null) return characterReference;
    pos=parseText.lastIndexOf('&',pos-1);
  }
  return null;
}

boolean isInQuotes = false;
boolean isInApos = false;
for (int x = pos; x < text.length(); x++) {
  char c = text.charAt(x);
  switch (c) {
  case '>':
    if (!isInQuotes && !isInApos) {
      if (x > 2 && text.subSequence(x - 3, x).equals("---")) {

private static List<Segment> addURLSegmentsFromCSS(final List<Segment> uriSegments, final Segment cssSegment) {
  final Source source=cssSegment.getSource();
  final ParseText parseText=source.getParseText();
  final int breakAtIndex=cssSegment.getEnd();
  for (int pos=cssSegment.getBegin(); (pos=parseText.indexOf("url(",pos,breakAtIndex))!=-1;) {
    pos+=4;
    while (pos<breakAtIndex && Segment.isWhiteSpace(parseText.charAt(pos))) pos++;
    if (pos>=breakAtIndex) break;
    if (isQuote(parseText.charAt(pos))) {
      pos++;
      if (pos>=breakAtIndex) break;
    }
    final int uriBegin=pos;
    final int closingBracketPos=parseText.indexOf(')',uriBegin,breakAtIndex);
    if (closingBracketPos==-1) break;
    pos=closingBracketPos;
    while (Segment.isWhiteSpace(parseText.charAt(pos-1))) pos--;
    if (isQuote(parseText.charAt(pos-1))) pos--;
    final int uriEnd=pos;
    if (uriEnd<=uriBegin) break;
    uriSegments.add(new Segment(source,uriBegin,uriEnd));
    pos=closingBracketPos;
  }
  return uriSegments;
}

public ProspectiveTagTypeIterator(final Source source, final int pos) {
  // returns empty iterator if pos out of range
  final ParseText parseText=source.getParseText();
  cursor=root;
  int posIndex=0;
  try {
    // find deepest node that matches the text at pos:
    while (true) {
      final TagTypeRegister child=cursor.getChild(parseText.charAt(pos+(posIndex++)));
      if (child==null) break;
      cursor=child;
    }
  } catch (IndexOutOfBoundsException ex) {} // not avoiding this exception is expensive but only happens in the very rare circumstance that the end of file is encountered in the middle of a potential tag.
  // go back up until we reach a node that contains a list of tag types:
  while (cursor.tagTypes==null) if ((cursor=cursor.parent)==null) break;
}

private static CharacterReference getNext(final Source source, int pos, final Config.UnterminatedCharacterReferenceSettings unterminatedCharacterReferenceSettings) {
  final ParseText parseText=source.getParseText();
  pos=parseText.indexOf('&',pos);
  while (pos!=-1) {
    final CharacterReference characterReference=construct(source,pos,unterminatedCharacterReferenceSettings);
    if (characterReference!=null) return characterReference;
    pos=parseText.indexOf('&',pos+1);
  }
  return null;
}

if (startTag.name.equals(searchName)) return startTag;
if (startTag.name.startsWith(searchName) && startTag.isPartialNameSearchMatch(searchName)) return startTag;
if (startTag.name.length()<searchName.length() && source.getParseText().containsAt(searchName,startTag.begin+searchStartTagType.startDelimiterPrefix.length())) return startTag;
startTag=(StartTag)startTag.getPreviousTag(searchStartTagType);
int begin=pos;
do {
  begin=parseText.lastIndexOf(startDelimiter,begin);
  if (begin==-1) return null;
  final StartTag startTag=(StartTag)Tag.getTagAt(source,begin,false);

/**
 * Indicates whether the specified source document position is at the end of a tag's {@linkplain Attributes attributes}.
 * <br />(<a href="TagType.html#DefaultImplementation">default implementation</a> method)
 * <p>
 * This method is called internally while parsing {@linkplain Attributes attributes} to detect where they should end.
 * <p>
 * It can be assumed that the specified position is not inside a quoted attribute value.
 * <p>
 * The default implementation simply compares the {@linkplain ParseText parse text} at the specified
 * position with the {@linkplain #getClosingDelimiter() closing delimiter}, and is equivalent to:<br />
 * <code>source.</code>{@link Source#getParseText() getParseText()}<code>.containsAt(</code>{@link #getClosingDelimiter() getClosingDelimiter()}<code>,pos)</code>
 * <p>
 * The <code>isClosingSlashIgnored</code> parameter is only relevant in the {@link #NORMAL} start tag type,
 * which makes use of it to cater for the '<code>/</code>' character that can occur before the 
 * {@linkplain #getClosingDelimiter() closing delimiter} in {@linkplain StartTag#isEmptyElementTag() empty-element tags}.
 * It's value is always <code>false</code> when passed to other start tag types.
 *
 * @param source  the {@link Source} document.
 * @param pos  the character position in the source document.
 * @param isClosingSlashIgnored  indicates whether the {@linkplain StartTag#getName() name} of the {@linkplain StartTag start tag} being tested is incompatible with an {@linkplain StartTag#isEmptyElementTag() empty-element tag}.
 * @return <code>true</code> if the specified source document position is at the end of a tag's {@linkplain Attributes attributes}, otherwise <code>false</code>.
 */
public boolean atEndOfAttributes(final Source source, final int pos, final boolean isClosingSlashIgnored) {
  return source.getParseText().containsAt(getClosingDelimiter(),pos);
}

static CharacterReference construct(final Source source, final int begin, final Config.UnterminatedCharacterReferenceSettings unterminatedCharacterReferenceSettings) {
  try {
    if (source.getParseText().charAt(begin)!='&') return null;
    return (source.getParseText().charAt(begin+1)=='#')
      ? NumericCharacterReference.construct(source,begin,unterminatedCharacterReferenceSettings)
      : CharacterEntityReference.construct(source,begin,unterminatedCharacterReferenceSettings.characterEntityReferenceMaxCodePoint);
  } catch (IndexOutOfBoundsException ex) {
    return null;
  }
}

private static boolean isXML(final Segment firstNonTextSegment) {
  if (firstNonTextSegment==null || !(firstNonTextSegment instanceof Tag)) return false;
  Tag tag=(Tag)firstNonTextSegment;
  if (tag.getTagType()==StartTagType.XML_DECLARATION) return true;
  // if document has a DOCTYPE declaration and it contains the text "xhtml", it is an XML document:
  if (tag.source.getParseText().indexOf("xhtml",tag.begin,tag.end)!=-1) return true;
  return false;
}

if (this==EndTagType.NORMAL && source.getParseText().containsAt("</script",pos)) {

  public boolean atEndOfAttributes(final Source source, final int pos, final boolean isClosingSlashIgnored) {
    final ParseText parseText=source.getParseText();
    return parseText.charAt(pos)=='>' || (!isClosingSlashIgnored && parseText.containsAt("/>",pos));
  }
}

  protected int getEnd(final Source source, int pos) {
    final ParseText parseText=source.getParseText();
    boolean insideQuotes=false;
    do {
      final char c=parseText.charAt(pos);
      if (c=='"') {
        insideQuotes=!insideQuotes;
      } else if (c=='>' && !insideQuotes) {
        return pos+1;
      }
    } while ((++pos)<source.getEnd());
    return -1;
  }
}

if (startTag.name.equals(searchName)) return startTag;
if (startTag.name.startsWith(searchName) && startTag.isPartialNameSearchMatch(searchName)) return startTag;
if (startTag.name.length()<searchName.length() && source.getParseText().containsAt(searchName,startTag.begin+searchStartTagType.startDelimiterPrefix.length())) return startTag;
startTag=(StartTag)startTag.getNextTag(searchStartTagType);
int begin=pos;
do {
  begin=parseText.indexOf(startDelimiter,begin);
  if (begin==-1) return null;
  final StartTag startTag=(StartTag)Tag.getTagAt(source,begin,false);

  /**
   * Returns the {@linkplain Tag#getEnd() end} of a tag of this type, starting from the specified position in the specified source document.
   * <br />(<a href="TagType.html#ImplementationAssistance">implementation assistance</a> method)
   * <p>
   * This default implementation simply searches for the first occurrence of the
   * {@linkplain #getClosingDelimiter() closing delimiter} after the specified position, and returns the position immediately
   * after the end of it.
   * <p>
   * If the closing delimiter is not found, the value <code>-1</code> is returned.
   *
   * @param source  the {@link Source} document.
   * @param pos  the position in the source document.
   * @return the {@linkplain Tag#getEnd() end} of a tag of this type, starting from the specified position in the specified source document, or <code>-1</code> if the end of the tag can not be found.
   */
  protected int getEnd(final Source source, final int pos) {
    final int delimiterBegin=source.getParseText().indexOf(getClosingDelimiter(),pos);
    return (delimiterBegin==-1 ? -1 : delimiterBegin+getClosingDelimiter().length());
  }
}

static final Tag getPreviousTagUncached(final Source source, final int pos, final int breakAtPos) {
  // returns null if pos is out of range.
  try {
    final ParseText parseText=source.getParseText();
    int begin=pos;
    do {
      begin=parseText.lastIndexOf('<',begin,breakAtPos); // this assumes that all tags start with '<'
      // parseText.lastIndexOf and indexOf return -1 if pos is out of range.
      if (begin==-1) return null;
      final Tag tag=getTagAt(source,begin,false);
      if (tag!=null && tag.includeInSearch()) return tag;
    } while ((begin-=1)>=0);
  } catch (IndexOutOfBoundsException ex) {
    throw new AssertionError("Unexpected internal exception");
  }
  return null;
}

if (isStatic()) {
  name=getNamePrefix();
  if (!parseText.containsAt(getClosingDelimiter(),startDelimiterEnd)) {
    if (source.logger.isErrorEnabled()) source.logger.error(source.getRowColumnVector(pos).appendTo(new StringBuilder(200).append("EndTag of expected format ").append(staticString).append(" at ")).append(" not recognised as type '").append(getDescription()).append("' because it is missing the closing delimiter").toString());
    return null;
  name=source.getName(nameBegin,nameEnd);
  int expectedClosingDelimiterPos=nameEnd;
  while (Segment.isWhiteSpace(parseText.charAt(expectedClosingDelimiterPos))) expectedClosingDelimiterPos++;
  if (!parseText.containsAt(getClosingDelimiter(),expectedClosingDelimiterPos)) {
    if (source.logger.isErrorEnabled()) source.logger.error(source.getRowColumnVector(pos).appendTo(new StringBuilder(200).append("EndTag ").append(name).append(" at ")).append(" not recognised as type '").append(getDescription()).append("' because its name and closing delimiter are separated by characters other than white space").toString());
    return null;

  protected int getEnd(final Source source, int pos) {
    final ParseText parseText=source.getParseText();
    boolean insideQuotes=false;
    boolean insideSquareBrackets=false;
    do {
      final char c=parseText.charAt(pos);
      if (insideQuotes) {
        if (c=='"') insideQuotes=false;
      } else {
        switch (c) {
          case '>':
            if (!insideSquareBrackets) return pos+1;
            break;
          case '"':
            insideQuotes=true;
            break;
          case '[':
            insideSquareBrackets=true;
            break;
          case ']':
            insideSquareBrackets=false;
            break;
        }
      }
    } while ((++pos)<source.getEnd());
    return -1;
  }
}

  protected Tag constructTagAt(final Source source, final int pos) {
    final int closingDelimiterPos=source.getParseText().indexOf('>',pos+1);
    if (closingDelimiterPos==-1) return null;
    final Tag tag=constructStartTag(source,pos,closingDelimiterPos+1,"",null);
    if (source.logger.isErrorEnabled()) source.logger.error(source.getRowColumnVector(tag.getBegin()).appendTo(new StringBuilder(200).append("Encountered possible StartTag at ")).append(" whose content does not match a registered StartTagType").toString());
    return tag;
  }
}

Javadoc

Represents the text from the Source document that is to be parsed.

This interface is normally only of interest to users who wish to create custom tag types.

The parse text is defined as the entire text of the source document in lower case, with all Segment#ignoreWhenParsing() segments replaced by space characters.

The text is stored in lower case to make case insensitive parsing as efficient as possible.

This interface provides many methods which are also provided by the java.lang.String class, but adds an extra parameter called breakAtIndex to the various indexOf methods. This parameter allows a search on only a specified segment of the text, which is not possible using the normal String class.

ParseText instances are obtained using the Source#getParseText() method.

Most used methods

charAt
Returns the character at the specified index.
containsAt
Indicates whether this parse text contains the specified string at the specified position. This meth
indexOf
Returns the index within this parse text of the first occurrence of the specified string, starting t
lastIndexOf
Returns the index within this parse text of the last occurrence of the specified string, searching b
length
Returns the length of the parse text.
subSequence
Returns a new character sequence that is a subsequence of this sequence.

Popular in Java

Running tasks concurrently on multiple threads
getSystemService (Context)
scheduleAtFixedRate (ScheduledExecutorService)
orElseThrow (Optional)
Return the contained value, if present, otherwise throw an exception to be created by the provided s
BigDecimal (java.math)
An immutable arbitrary-precision signed decimal.A value is represented by an arbitrary-precision "un
MalformedURLException (java.net)
This exception is thrown when a program attempts to create an URL from an incorrect specification.
Permission (java.security)
Legacy security code; do not use.
Pattern (java.util.regex)
Patterns are compiled regular expressions. In many cases, convenience methods such as String#matches
Point (java.awt)
A point representing a location in (x,y) coordinate space, specified in integer precision.
Filter (javax.servlet)
A filter is an object that performs filtering tasks on either the request to a resource (a servlet o
Best plugins for Eclipse

How to useParseText in net.htmlparser.jericho

Best Java code snippets using net.htmlparser.jericho.ParseText (Showing top 20 results out of 315)

How to use
ParseText
in
net.htmlparser.jericho